Running jobs using multiple GPUs#
Running Nvidia Modulus on Sunbird#
The Apptainer image is located at
[s.1915438@sl1 ~]$ module display modulus/22.07
-------------------------------------------------------------------
/apps/local/modules/tools/modulus/22.07:
module load apptainer/1.0.3
module-whatis add NVIDIA Modulus to PATH environment variables
setenv MODULUS_IMG /apps/local/tools/modulus/22.07/modulus_apptainer/modulus.img
-------------------------------------------------------------------
The setenv tells us that the environment variable MODULUS_IMG points to /apps/local/tools/modulus/22.07/modulus_apptainer/modulus.img. So, we can run the Apptainer image with bypassed default volume binds and environments varible exports.
apptainer shell --nv --contain --cleanenv --bind "$(pwd)":/data,/tmp:/tmp $MODULUS_IMG
This works perfectly fine with 1 GPU. But for multiple GPU we need to use the mpirun (link).
The command to use the mpirun is mpirun -np #GPUs. For example, with 2 GPUs
Apptainer> mpirun -np 2 python ldc/ldc_2d.py
Initialized process 0 of 2 using method "openmpi". Device set to cuda:0
Initialized process 1 of 2 using method "openmpi". Device set to cuda:1
Problem with mpirun (SKIP IF NEEDED)#
We need to export $CUDA_VISIBLE_DEVICES inside the Apptainer image otherwise if \(n\) GPUs are allocated than mpirun will use first \(n\) GPUs. For example, if I was allocated GPU number 2 and 3.
[s.1915438@scs2043 examples]$ echo $CUDA_VISIBLE_DEVICES
2,3
[s.1915438@scs2043 ~]$ nvidia-smi
Mon Sep 12 01:20:59 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:27:00.0 Off | 0 |
| N/A 53C P0 231W / 250W | 5487MiB / 40960MiB | 57% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:28:00.0 Off | 0 |
| N/A 56C P0 94W / 250W | 5487MiB / 40960MiB | 71% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... On | 00000000:43:00.0 Off | 0 |
| N/A 45C P0 47W / 250W | 2MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:44:00.0 Off | 0 |
| N/A 45C P0 45W / 250W | 2MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-PCI... On | 00000000:A3:00.0 Off | 0 |
| N/A 40C P0 47W / 250W | 2MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-PCI... On | 00000000:A4:00.0 Off | 0 |
| N/A 40C P0 47W / 250W | 2MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-PCI... On | 00000000:C3:00.0 Off | 0 |
| N/A 39C P0 48W / 250W | 2MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-PCI... On | 00000000:C4:00.0 Off | 0 |
| N/A 39C P0 46W / 250W | 2MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2282 C python 5485MiB |
| 1 N/A N/A 2283 C python 5485MiB |
+-----------------------------------------------------------------------------+
Use mpirun#
We need to export $CUDA_VISIBLE_DEVICES to properly use the right GPUs.
[s.1915438@scs2043 ~]$ nvidia-smi -i 2,3
Mon Sep 12 01:29:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 2 NVIDIA A100-PCI... On | 00000000:43:00.0 Off | 0 |
| N/A 63C P0 112W / 250W | 5487MiB / 40960MiB | 66% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:44:00.0 Off | 0 |
| N/A 62C P0 206W / 250W | 5487MiB / 40960MiB | 82% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 2 N/A N/A 3255 C python 5485MiB |
| 3 N/A N/A 3256 C python 5485MiB |
+-----------------------------------------------------------------------------+
Best Practice#
Allocate the resources
salloc --nodes=1 --account=scw1901 --partition=accel_ai --gres=gpu:2 --nodelist=scs2043
Switch to the compute node
srun --pty bash
Load NVIDIA Modulus 22.07
module load modulus/22.07
Start the container with –env for CUDA devices
apptainer shell --nv --contain --cleanenv --bind "$(pwd)":/data,/tmp:/tmp --env CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES $MODULUS_IMG
One can check if the $CUDA_VISIBLE_DEVICES was successfully imported.
Apptainer> env | grep CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=2,3
Running an example.
Apptainer> cd /data/ldc/
Apptainer> ls
conf conf_zeroEq ldc_2d.py ldc_2d_importance_sampling.py ldc_2d_zeroEq.py openfoam
Apptainer> mpirun -np 2 python ldc_2d.py
Initialized process 0 of 2 using method "openmpi". Device set to cuda:0
Initialized process 1 of 2 using method "openmpi". Device set to cuda:1
[01:28:14] - attempting to restore from: outputs/ldc_2d
[01:28:14] - optimizer checkpoint not found
[01:28:14] - model flow_network.pth not found
[01:28:14] - attempting to restore from: outputs/ldc_2d
[01:28:14] - optimizer checkpoint not found
[01:28:14] - model flow_network.pth not found
[01:28:24] - [step: 0] record constraint batch time: 3.484e-01s
[01:28:40] - [step: 0] record validators time: 1.606e+01s
[01:28:52] - [step: 0] record inferencers time: 1.134e+01s
[01:28:52] - [step: 0] saved checkpoint to outputs/ldc_2d
[01:28:52] - [step: 0] loss: 5.037e-02
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:52] - Reducer buckets have been rebuilt in this iteration.
[01:28:54] - Attempting cuda graph building, this may take a bit...
[01:28:54] - Attempting cuda graph building, this may take a bit...
[01:29:00] - [step: 100] loss: 7.916e-03, time/iteration: 8.412e+01 ms
[01:29:07] - [step: 200] loss: 5.416e-03, time/iteration: 6.566e+01 ms
[01:29:14] - [step: 300] loss: 4.992e-03, time/iteration: 6.685e+01 ms
[01:29:20] - [step: 400] loss: 3.430e-03, time/iteration: 6.476e+01 ms
[01:29:28] - [step: 500] loss: 2.418e-03, time/iteration: 8.192e+01 ms
[01:29:35] - [step: 600] loss: 2.150e-03, time/iteration: 6.622e+01 ms
[01:29:41] - [step: 700] loss: 1.699e-03, time/iteration: 6.515e+01 ms
The
nvidia-smi’s output is as follows:
[s.1915438@sl1 ~]$ ssh scs2043
Last login: Mon Sep 12 01:13:49 2022 from sl1
[s.1915438@scs2043 ~]$ nvidia-smi -i 2,3
Mon Sep 12 01:29:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 2 NVIDIA A100-PCI... On | 00000000:43:00.0 Off | 0 |
| N/A 63C P0 112W / 250W | 5487MiB / 40960MiB | 66% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:44:00.0 Off | 0 |
| N/A 62C P0 206W / 250W | 5487MiB / 40960MiB | 82% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 2 N/A N/A 3255 C python 5485MiB |
| 3 N/A N/A 3256 C python 5485MiB |
+-----------------------------------------------------------------------------+
[s.1915438@scs2043 ~]$